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In educational technology and learning sciences, there are multiple uses for a predictive model of whether 
a student will perform a task correctly or not. For example, an intelligent tutoring system may use such 
a model to estimate whether or not a student has mastered a skill. We analyze the significance of data 
recency in making such predictions, i.e., asking whether relatively more recent observations of a student’s 
performance matter more than relatively older observations. We develop a new Recent-Performance 
Factors Analysis model that takes data recency into account. The new model significantly improves 
predictive accuracy over both existing logistic-regression performance models and over novel baseline 
models in evaluations on real-world and synthetic datasets. As a secondary contribution, we demonstrate 
how the widely used cross-validation with 0-1 loss is inferior to AIC and to cross-validation with Li 
prediction error loss as a measure of model performance. 


1. Introduction 


A central field of researeh in edueational teehnology and assessment is eoneemed with modeling 
the probability that a student will respond eorreetly to some question. This modeling is used to 
analyze test answers, as with Item Response Theory; in adaptive learning teehnologies, sueh 


as the use of Bayesian Knowledge Tracing (Corbett and Anderson, 19951 in intelligent tutoring 
systems; to analyze the domains that students study, such as the study of transfer aeross tasks 


(Pavlik et al., 2011]); and to understand student behaviors like gaming the system (Baker et al.. 


20041. 


Our work advanees this field by examining alternative representations of receney. The intu¬ 
ition is simple: as students praetiee a skill, we expeet their understanding to inerease and their 
performanee to improve. Having reeently sueeeeded at a task may make it more likely that 
learning has taken plaee, and sueh a moment of learning ought to eontribute to our predietion 
of sueeessful performanee. This work is the first thorough investigation of reeeney effeets in 
performanee modeling. 

We begin by describing a space of models of receney that fits into the logistic regression 
approach to performance modeling, as exemplified by Item Response Theory models. This 
spaee subsumes many existing modeling efforts, including the Additive Faetors Model (AFM) 
( Cen et al., 2006a| ), Performanee Factors Analysis (PFA) (Pavlik et al., 2009), and the reeency- 
weighted model by Gong and colleagues (|Gong et al., 201 1[). We then propose the Reeent- 


Performanee Faetors Analysis (R-PFA) model. We evaluate this model’s accuraey on a real- 
world dataset of student performanee from the Assistments system (Baker et al., 201 1|). Finally, 


1 


















since real-world datasets exhibit certain data limitations, we further examine the properties of 
the new R-PFA model and several alternatives on a range of simulated datasets. 


2. Prior Work in Performance Modeling 


To predict whether or not a student will suceeed at eompleting a task, at a minimum, we ought to 
take into aeeount some eharaeteristie of the student and the task. There are two ehief approaehes 
to sueh modeling in the literature: graphical models, notably including Bayesian Knowledge 
Tracing, and logistic regression models. 

In the best-known examples of logistic regression modeling. Item Response Theory and 
Raseh models include predictors relating to the ability of the student and the difficulty of the 
task. A refinement on this approaeh is Linear Logistic Test Models (LLTM) ( [Fischer, 1973[ 
de Boeck and Wilson, 2004]). These logistie regression models replaee the predietor relating 


to the diffieulty of the individual tasks with a predictor that groups together tasks that share 
an underlying skill or Knowledge Component (KC). Beeause task diffieulty is estimated from 
data, replaeing per-task parameters with per-skill parameters reduees the number of model pa¬ 
rameters, and leverages the power of task-level observations to provide a relatively more robust 
estimate of skill diffieulty. 

The LLTM class of models includes Additive Faetors Model (AFM) and Performanee Fac¬ 
tors Analysis (PFA) ( jCen et ah, 2006a[[Pavlik et ah, 2009[[Chi et ah, 201 Ij ). These models differ 
only in how they refleet prior practiee to prediet a student’s future performance. The original 
LLTM is meant to refleet student knowledge during a short examination where we assume no 
learning is oceurring, and therefore it does not include any summaries for the effeets of practice, 
only effeets of student and skill (i.e., student and KC intercepts). AFM introduces a slope eoef- 
fieient for the total number of prior opportunities a student has had to praetiee a KC. The elaim 
is that the more praetiee a student has had, the more likely they should be to get the next item 
eorrect. PFA deeomposes the number of total prior praetiee opportunities into separate eounts of 
successes and failures; with the assumption that successful and unsuceessful practiee may have 
differential value for student learning and thus for probability of correctness on the next task. 


3. Model Comparison and Model Design 

Beeause performance modeling is rooted in statistics and machine learning, models of perfor- 
manee are often evaluated in terms of predictive aeeuracy. We take the position that models of 
task performance need to be interpretable above all. This stance disfavors models with good 
predictive accuracy when such models are black boxes, because such models make it diffieult to 
advanee the seienee of learning, or to develop systems that aet on model predietions. Nonethe¬ 
less, predictive accuracy is a sensible way to choose among multiple interpretable models. 

We consider interpretability in terms of model realism and complexity. Model realism con¬ 
siders that many aspeets of the world may affect student performance; a model should refleet as 
many of the biggest effeets as possible, and do so as accurately or plausibly as possible in terms 
of both structure and estimated parameters. The danger of realistic models is that they ean be 
highly eomplex. Models that are excessively complex may not be fully identified, or they may 
’’overfit” the available data, i.e., they may refleet data eharaeteristies that are minor at best, or 
inaccurate at worst. 
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By way of example, we can place Bayesian Knowledge Tracing (BKT) & PFA on the 
realism-complexity continuum. BKT has a generative structure that represents (to a degree) 
human learning, but this structure also leads to mathematical complexity and may lead to im¬ 
plausible parameter values (Beck and Chang, 20071. PFA is relatively simpler mathematically 
because of its linear structure, but may still yield implausible parameters. For example, unless 
the parameters are artificially restricted, PFA may estimate that practice on a skill is associated 
with a decrease in the probability that a student will correctly answer a problem on that skill. 


BKT and PFA have comparable (and mediocre) predictive accuracy. (Pavlik et ah, 2009 Gong 


et ah, 2011) 


Accordingly, we aim for a model that is realistic, not excessively complex, and with good 
predictive accuracy. This is no small goal; for instance, models may have similar predictive 
accuracy but different parameter interpretations. For example, ( Kaser et ah, 2014] ) find that 
AFM only estimates positive slopes for practice on about 50% of the skills whereas other models 
estimate positive slopes for practice of almost all skills. However, if improvement in parameter 
plausibility does not lead to reliable improvement in predictive accuracy ( [Kaser et ah, 2014| ), it 
is hard to decide which model is preferable. 

We use AIC as an operational definition of model quality. AIC is a likelihood-based mea¬ 
sure of model accuracy that incorporates a penalty for model complexity. Minimizing AIC is 
equivalent to minimizing KL-divergence risk. An alternative technique for model comparison 
is cross-validation. An especially important technique in educational data mining is student- 
stratified cross-validation, where a model is trained on one set of students, and used to make 
predictions for a held-out set of students. In this way, one can claim to have a reasonable ex¬ 
pectation of how well the model will perform on entirely new students. Still, AIC is known to 


be asymptotically equivalent to cross-validation (Akaike, 1985 Wasserman, 2004 James et ah. 


2013|). In fact, in sectionwe demonstrate that AIC is superior as a measure of model fit to the 


oft-used cross-validation with a 0-1 loss function. 

One way to consider the distinction between AIC and cross-validation is to consider that 
these measures represent different loss functions. Cross-validation can use any loss function, 
but 0-1 loss is most common, while the KL-divergence that AIC uses is more similar to an Li 
prediction error (PE) loss. 

Let L be the loss function, let Yij the actual correct (1) or incorrect (0) outcome for student 
z on a practice opportunity on skill j, and let = PiYij = 1) be the estimated probability of a 
correct answer (i.e. the continuous output from the logistic regression model). 


0-1 loss 


PE loss 


^iPiji ^ij) 


r 0 if \v^j-Yij\ <0.5 
( 1 otherwise 

\Pij ~ 'Pij\ 


( 1 ) 

( 2 ) 


The primary difference between these loss functions is in whether we are interested in only 
prediction accuracy, or in accuracy and model confidence. The more confident a particular 
model is in it’s predictions, the closer the estimated probability of a correct response, pij, will 
be to the actual student response. Under PE loss, a hypothetical model that is accurate but not 
confident in its prediction is considered to perform worse than a model that is accurate and 
confident. By contrast, 0-1 loss discards information on model confidence, and treats confident 
and non-confident models equally. When pij is near 0.5, the two measures may disagree. The 
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0-1 loss function may prefer a model that has high predietive accuracy even if p is near 0.5, i.e., 
even if the model is not eonfident in its predietions. As model eonfidenee inereases, agreement 
in model ranking between the two loss functions will increase. 

For example, suppose that for a partieular individual with two opportunities for praetiee on 
KC j, we observe Yij = (0,1). Now suppose that Model 1 estimates the probability of a eorreet 
response as p = (0.48, 0.52), while Model 2 estimates the probability of a eorreet response as 
p = (0.1, 0.9). Then 0-1 loss will not distinguish between these two models, but PE loss will 
prefer the model that is more eonfident in predieting that the first attempt response is ineorreet, 
and the second attempt a eorreet. 

When eross-validation uses 0-1 loss, it ignores model eonfidenee, but AIC considers both 
model accuraey and model confidence. The significanee of this distinetion will beeome apparent 
in the model comparisons below. 


3.1. Recent-Performance Factors Analysis 


Recent-performanee Faetors Analysis (R-PFA) foeuses on recent history, rather than the com¬ 


plete praetiee history of a student (Galyardt and Goldin, 20141. The first intuition behind this 
model is that having learned a skill makes it more likely that the student will get the next item 
eorreet; not having learned the skill makes an ineorreet response more likely than a eorreet. 
The seeond intuition is that reeent praetiee history with a KC may eontain all the neeessary 
information about whether or not a student has aequired the KC. We ean relate this idea to a 
‘moment-of-learning’. If a student has been sueeessful with recent praetiee, then a moment- 
of-learning has likely already oeeurred. If reeent attempts have not been sueeessful, then the 
student has most likely not yet learned the KC. 


3.2. Formal Model Descriptions 


To evaluate R-PFA comprehensively, we examine a number of alternative models. The notation 
we use differs from other publieations of some models, but we hope that our consistent use of 
notation across all models will facilitate the eomparison. We use the following notation: 


KC index, j = 1,..., J 
student index, i = 1,..., N 
praetiee opportunity index, f = 1,..., Oij 
ijt response by student i, on opportunity t of KC j, 

0 if ineorreet 
1 if eorreet 

Probability of a eorreet response: Pr{Xijt = 1) 
eount of past opportunities 

reeency-weighted count of previous suceesses, up to trial t 
reeency-weighted count of previous failures, up to trial t 
Rijt reeency-weighted proportion of past successes 


J 

i 

t 

X 


Pijt 

Tijt 

Sijt 


Xijt 


All the models we examine are logistie regressions, where the general form is 

Eaeh of the main models that we examine uses a different representation of a student’s prior 
praetiee. These terms, whieh replaee the generie Z’s in equation are displayed for elear 
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Table 1: Terms in predictive model variants. 



Student 

ability 

KC 

difficulty 

Success 

eount 

Failure 

eount 

Total 

trials 

Recent 

success rate 

AFM 

e^ 




IjTijt 


PFA 

9^ 


c^jSijt 

Pi Fiji 



S-only 

o^ 

(dj 

^ij t 




R-only 

0^ 






R-AFM 

e^ 




IjTijt 


R-PFA 

o^ 

I3j 


PjFijt 




comparison in table The previously published models are Additive Factors Model (AFM) 
( |Cen et ah, 2007 [ |Cen et ah, 2008| ), Performance Factors Analysis (PFA) ( [Pavlik et ah, 2009[ ), 
and PFA-decay (Gong et ah, 20111. We additionally inelude baseline models S-only, R-only, 
and R-AFM. 

AFM represents prior praetiee as the total number of prior opportunities for a student to 
praetiee the KC: 

logit{pijt) = 0i + (3j + 'jjTijt. (4) 

PFA distinguishes effects of prior successes and prior failures in predieting future sueeess: 


Oi f5j ~\~ c^jSiji pjFij^. 


(5) 


PFA-deeay ( Gong et ah, 201 Ij ) is an adjustment to PFA that uses a deeay weight to aeeount 
for reeency of observations: 


t-i 

p=i 

t-i 

Fit, = 

P=1 


( 6 ) 

(7) 


Aside from the deeay weight, PFA-deeay uses the same predietors S and F as original PFA. In 
faet, when d = 1, PFA-deeay and PFA are exaetly the same. Thus, we refer to both these models 
that only inelude (possibly decayed) eounts S and F as PFA. 

Another eommon approach in general regression modeling, is to perform a logarithmie trans¬ 
formation on count variables. In educational applications (e.g., Yudelson et al. 2014] Chi et al. 


20111, a logarithmie transformation of praetiee eounts is an argument that praetiee beyond some 


threshold amount has only a marginal effeet on the probability of a eorreet response. The trans¬ 
formation represents a sensible, realistic intuition about performance, but the regression eoef- 
fieient on a log-transformed eount is difficult to interpret. Moreover, the logarithmie transfor¬ 
mation is simply a down-weighting of the total amount of praetiee, it does not aeeount for any 
reeeney effeets. 
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As an alternative for the eount of sueeesses, we introduce an exponentially decayed propor¬ 
tion of successes, Rijt. 


Rijt 


spt-1 

Z^p=-2 “ 




( 8 ) 


Aside from the decay weighting, which is explained below, the proportion of successes is quite 
simply the count of prior successes divided by the count of total prior attempts. 

There are two issues to consider in decay weighting: the weighting function (kernel) and 
the weight strength. In non-parametric methods, the choice of kernel is generally less important 
than the decay weight ( [Wasserman, 2006 1. An example of an alternate weighting function is the 
box kernel, where the tuning parameter is window size k in the sense of the ‘last k attempts’. 
The interpretation is simple, but box kernel treats all attempts within k as equally important, 
which may not be sensible. In the PFA model, k covers the entire practice history, which weighs 
all attempts as equally important, and does not discard even the oldest evidence. 

R-PFA and PFA-decay ( |Gong et ah, 2011 [ ) both place an exponential decay weight on prior 
practice. Importantly, Gong et al. ( 2011 1 fix d at 0.9, aiming not to “eliminate the effects of 
further practices too quickly.” This is an overly simplistic choice, and as we shall demonstrate 
in section 4.2. [ simply tuning the decay parameter in the PFA models appropriately produces 
large improvements in predictive accuracy. In exponential weighting, different values of the 
decay weight d control the ‘smoothing’ of Sijt (PFA-decay) and Rijt (R-PFA) over the history 
of practice. If d = 1, then a student’s entire history of practice gets equal weight. Alternatively, if 
d = 0.1, then 90% of the weight is on the single previous trial, and 9% is placed on the 2nd most 
recent attempt, so that effectively only the last attempt is counted in the recent history. Choosing 
a weight d is precisely analogous to choosing smoothing bandwidth in nonparametric statistics 
(e.g., Wasserman 2006[ figure 4.5). For exponential decay, the decay parameter d ranges from 
0 to 1, while for the box kernel, the window size k ranges from 1 to infinity, so that selecting 
the optimal d has a more tractable search space. Thus the exponential decay function has both 
a computational advantage for tuning decay weight, and interpretability advantage since older 
evidence is down-weighted. 

We can see the effect of the different values of d most clearly in the pattern “Student Slips 
Twice” (Figure [^. This student has the attempt history = (0,1,1,1,0, 0,1,1,1), as indi¬ 
cated by the red diamonds in the figure. With a decay weight d = 0.2, after 1 error followed 
by 3 corrects, Sij^ = 1.24, but then when the student misses the next item, Sije drops to 0.248. 
Without decay, i.e., when d = 1.0, Sijt grows slowly with each item that a student gets right, 
and never decreases after errors. For the highlighted decay weight d = 0.7, Sijt increases at a 
moderate pace with each correct, until Sij^ = 2.19. Then, when the student answers incorrectly 
on trial 5, SijQ = 1.53, a small drop, and with the subsequent error drops further to Sijy = 1.07. 
The impact is similar on the proportion of successes Rijt. 

The behavior of Sijt and Rijt with d = 0.7 mirrors our intuition. If we were tutoring a 
student one-on-one, on the third correct attempt in a row, we might think ‘Ok, they’ve mastered 
this skill.’ When the next attempt is incorrect, we might think ‘That was probably just a slip.’ On 
the second incorrect attempt, we might revise our assessment of the student’s knowledge: ‘Hmm, 
maybe they don’t know this.’ But after 3 subsequent correct responses in a row, we might be 
fairly convinced the student has learned the KC. This parallels exactly the Bayesian updating of 
the probability that a student has learned a KC that takes place in a Bayesian Knowledge Tracing 
(BKT) model. In this way, exponential decay weighting is capturing student performance in a 
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student Slips Twice 


Inconsistent Student 


CO 



student Slips Once 




Student Never Slips 



Trial 


DC 


student Slips Twice 


Inconsistent Student 



Trial 


Figure 1: Effect of exponential decay weighting on the count of successes Sijt and proportion 
of successes Rijt given distinct patterns of student behavior. Red diamonds signify correct and 
incorrect responses (at 1.0 and 0.0 on the vertical axis), and black circles signify the “ghost” at¬ 
tempts. Each line indicates a different value of the decay parameter d, the thick red line indicating 
d = 0.7. 
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similar way BKT, but without the complexity of a Hidden Markov Model. 

The recency-weighted proportion of successes R is similar to the recency-weighted count of 
successes S, but there are differences. The interpretation of Rijt is consistent across different 
values of the decay parameter. If R^t is near 1, then the student has been successful in recent 
attempts; if it is near zero, then the student has recently been unsuccessful. If the student has a 
fully successful history of practice, R will converge to 1 no matter the value of the decay weight. 
By contrast, S does not have a consistent interpretation. For any value of d, R is scaled to fall 
between 0 and 1, implying that R is easily interpretable as some proportion, e.g., ‘a student has a 
success rate of about 80% over the last few items.’ By contrast, S has asymptotic properties that 
complicate interpretation. Since each Xijt is either 1 or 0, the counts Sijt and Fijt are bounded 
by they geometric series = {1 — d)~^. \f d = 1, the series does not converge. For every 

d < 1, the series will converge to a different number. The asymptotic limit of this series is visible 
in the pattern “Student Never Slips” in figure[^ For d = 0.9 the limit is (1 — 0.9)”^ = 10; for 
d = 0.2, the limit is 1.25. The meaning of a particular value of S, but not R, depends on d. 

The consistent interpretation of R also allows us to interpret i? as a proxy for whether or 
not a student has experienced a moment of learning. As an example, consider two students with 
histories Xij = {0, 0,1,1,1,1}, and Xij = {0, 0,1,1,1,1,1,1,1,1}. Intuitively, we would tend 
to believe that both of these students have experienced a moment of learning and are likely to 
get the next item correct. For any value of d, these two students will have a similar R value. 
However, for a small value of d, S will be the same for these two students, but for a d closer 
to 1, S' will be different, and the predictions will be different. Thus S is not interpretable as a 
proxy for a moment of learning. This gives an interpretative advantage to the recency-weighted 
proportion R, and may or may not give a predictive advantage as well. 

Early observations of practice on a skill necessarily contain less evidence of student mastery 
than the accumulation of early and later observations. Thus, both Rijt and Sijt are noisy on early 
attempts on a KC. To illustrate, consider two students: the first student has the performance 
history of Xij = (1, 0). The second student has a performance history of Xi/j = (0,1,1,1,1). 
We would be highly doubtful that the former student has mastered the KC, while the latter 
student has likely mastered the KC. After one trial, the proportion of recent successes is 1 and 
0, respectively. The first student has a higher proportion of success after the first trial than the 
second student does after 5 trials, including 4 successful attempts in a row. Thus, the proportion 
of successes is a noisy representation for the first student. 

To adjust for this noise on the first few attempts, Rijt incorporates the assumption that 3 
attempts prior to the first attempt would have been incorrect. That is, we stipulate ghost attempts 
Xi,j -2 = = 0- This is making explicit an assumption that at time 0, a student 

has not already learned the KC, which is very plausible in educational data. These ghost attempts 
only affect the calculation of Rijt, i.e., they do not affect Rjt, and they are not extra instances in 
the dataset. Note that such ghost attempts implicitly also exist in the calculation of Sijt, in the 
sense that on trial one, the count of previous successes is zero. The ghost attempts are included 
in equation!^ and figure 

Model VARIANTS including R To separate the effects of recent practice, total practice, 
and the differential predictive effects of recent success and failure, we compare three model 


8 


variants that contain R: 


R-only 

lOQlt(^Pijf^ 15j “ 1 “ 

(9) 

R-AFM 


(10) 

R-PFA 

loQttiyPiji) I5j “ 1 “ pjFiji -|- SjRiji. 

(11) 


We compare these three recent-history models with the established AFM and PFA models, 
as well as the S-only baseline model that uses only the count of successes. For PFA & R- 
PFA, which include two decay-weighted variables, we consider both the case where the tuning 
parameters are equal and the case where they are tuned separately. This allows for the potentially 
differential predictive power of recent successes vs. recent failures. 


4. Model Application to Real-world Data 


4.1. Methods 


We evaluate the models described above in modeling student performance in the Assistments 
data used in the “moment of learning” work by Baker and colleagues ( Baker et ah, 2011) . The 
data contain first attempts by 4138 students on problem sets involving 54 knowledge components 
(KC), for a total of 187,309 first attempts. Each problem is coded with only a single KC. Each 
KC was attempted between 89 and 16,200 times, and had an overall percent correct between 
23% and 95%. The data are from the mastery learning “Skill Builder” feature of Assistments, 
which allows teachers to set a threshold for the number of problems a student must correctly 
answer in a row to be considered proficient. For this data set, the threshold was set at either 3 or 
5. 


This data set is sparse at the student level. First, the median number of KCs seen by each 
student is 3, and 75% of students practice 7 or fewer different KCs. Second, the median number 
of total attempts per student summing across all KC’s is 20, and 435 students (11%) made 3 or 
fewer total problem-solving attempts. This sparsity of data at the student level means that any 
student effects in a model should be fit as random effects coming from a common distribution. 
In this way, we ‘pool’ the data, so the student effects 9i for students with less data shrink towards 
the mean student effect. The ghost attempts necessarily have the greatest influence on practice 
strings that are relatively short, i.e., they reduce the noise that would otherwise be present in Rij 
for these attempts. 

There are a large number of students per KC; of the 54 KC’s, only one is practiced by fewer 
than 25 students, and the median number of students per KC is 410. However, the number of 
attempts for each student on each KC is small, with a median of 4, and a mean of 8. For this 
reason, we also treat all KC intercepts and slopes as random effects. We did not include the 
covariance matrix for KC parameters in the model. 

The number of students per KC makes student-stratified cross-validation unreliable, if not 
entirely untenable. There are 54 KCs, but 27 of them are encountered by fewer than 410 students, 
i.e., fewer than 10% of the students. Moreover, the sparsity is not uniform; which KCs were 
attempted by particular students is not uniformly random. Omitting 10-20% of the students in 
a cross-validation fold leads to omitting a number of KCs. Therefore 5-fold or 10-fold cross- 
validation will result in very poor estimates of KC parameters, or an inability to use the model 
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to make predictions. Instead, AIC is used as the measure of model fit; as we demonstrate in 
section!^ AIC is at least as reliable, if not better than cross-validation. 

We used the glmer function in the R package lme4 to fit all models listed in Table [T]( |Bates| 
et ah, 2013] ). Counting all of the different tunings of relevant decay weights, we fit a total of 111 
models, though below we display only the results for the most illuminating comparisons. Dat^Q 
and analysis cod^are posted online. 


4.2. Results AND Discussion 


Importance of Recency Weighting We first compare pfa and afm to models where 
the prior practice representation is the recency-weighted count of successes Sijt (S-only) or the 
recency-weighted proportion of successes Rijt (R-only), as in figure]^ First, we find that PFA 
outperforms AFM, replicating prior research ( jPavlik et ah, 2009^ |Chi et ah, 201 1| ). Second, 
S-only with decay weight d = 1 outperforms AFM. With a decay parameter of 1, Sijt is simply 
the total count of all prior successes for person i on KC j. Thus, a simple count of successes is 
a better predictor of future success than the total count of practice. 

Third, recent success is a better predictor of learning than the entire history of practice, 
since both S-only and R-only outperform AFM and PFA. Fourth, Rijt and Sijt have the same 
predictive value when the decay parameters are small, d < 0.3, but R becomes a more powerful 
predictor than S as the decay parameter increases. In other words, as the predictor includes more 
practice history, the proportion of successes becomes more valuable than the count of successes. 
With d = 0.9, AIC for R-only is 1200 less than AIC for S-only, a substantial difference. 


Importance of Failures and Total Practice Given the baseline value of tracking 
recency-weighted successes, which already outperforms PFA without recency weighting, what 
is the value of additionally incorporating total practice or failed practice? Comparing R-only and 
R-AFM enables us to judge the additional predictive value of amount of total practice compared 
to recent success rate, and recency-weighted PFA (holding constant the decay weight for the 
success and failure counts) allows us to judge the additional predictive value of failed practice 
(figure]^. We find that adding a predictor for total amount of practice (R-AFM) improves on 
the performance of R-only, but recency-weighted PFA with separate success and recent failure 
counts produces even larger gains in predictive accuracy. The best model so far is PFA with 
d = 0.6, i.e., that the most relevant information for predicting future performance is contained 
in the most recent 3-5 attempts, including separate counts of successes and failures. 

Comparing Count versus Proportion of Successes As seen in figure|^ alone is 
a better predictor than S alone once d grows sufficiently large, even though they contain similar 
information. This finding stands even after incorporating failure information (figure Q. At each 
value of the decay parameter, R-PFA (with R and F) outperforms PFA (S and F), and once 
again, the difference increases as the recency weight approaches 1. 


Differential Recency of Successes and Failures Starting with the pfa model, 
we allow the decay weight for F to vary, while holding constant the decay weight for S at 
d = 0.6, which was optimal in the R-only, S-only, and PFA models (figure]^. We find that 

’https://sites.google.com/site/assistmentsdata/home/goldstein-baker-heffernan 
^https://sites.google.com/site/aprilgalyardt/research 
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failure counts deserve the lowest possible decay weight, implying that only a failure on the 
single most recent attempt contains relevant information for predicting future performance, and 
prior failures are less informative. The result is the same for the R-PFA model, allowing the 
decay weight for F to vary, while holding constant the decay weight for i? at d = 0.6. 

Figure [^suggests that tuning the success rate and the failure rate separately offers a distinct 
advantage. The final set of model comparisons verifies this finding. Figure shows R-PFA 
models where the recency weight for R varies between 0.5 and 0.8, and the recency weight for 
F ranges from 0.1 to 1.0. (Only the R-PFA models are shown, since the PFA models using S 
performed uniformly worse than equivalent models using R.) 

Of the models we compared on this dataset, the model with the highest predictive accuracy 
is R-PFA with recency weight d = 0.7 for the proportion of successes R, and recency weight 
d = 0.1 for count of failures F. For successes, with d = 0.7, the weights of the last 6 actions 
are respectively: {0.340, 0.238, 0.167, 0.117, 0.082, 0.057}. The 5 most-recent actions receive 
substantial weight, but 58% of the weight is on the two most-recent actions. In contrast, with d = 
0.1, the weights for failures on the last 6 actions are:{0.9, 0.09, 0.009, 0.0009, 0.00009,0.000009}. 
Applying the weight, if the last action is incorrect d ■ F ^ 0.9, and if the last action is correct, 
d ■ F Ri 0.1. Thus, in this best-performing model, R acts like a running average over the last 
2-5 actions, while F is effectively a binary indicator for whether the last action was correct or 
incorrect. 

The difference between the optimal tuning parameters for recent successes and recent fail¬ 
ures may also be accounting for the difference in slips and guesses. If a student knows the KC 
and has been correctly responding, then R Ri 1 and F 0. If this student then slips and re¬ 
sponds incorrectly, with the optimal decay parameters, R will decrease to 0.7, and F jumps to 
0.9. If the incorrect answer was truly a slip then the student will likely answer correctly on the 
next attempt, so that R increases towards 1 again, and F falls back toward zero. (See also Figure 
“Student slips once”.) In this way, R is largely unaffected by slips, while F is an indicator 
that the last response may have been a slip. Now consider a student who does not know the 
KC, and has a history of incorrect practice attempts, so that i? ~ 0 and F 1. If this student 
then guesses correctly on an item, R only increases to 0.3, and F falls to 0.1. Here R is largely 
unaffected by the correct guess, while F is an indicator that the last response may have been a 
guess. 


4.3. Interpreting the Best-Performing Model 


The best overall model for predicting future success from a student’s history is R-PFA with 3 
parameters for each KC: fdj, the ‘easiness’ of the KC; pj, the effect of recent failures with the 
KC, and 5j, the effect of recent successes with the KC. To examine model parameters in detail, 
consider that in a logistic regression model with random effects, the estimates of the coefficients 
may not be normally distributed when the data is sparse. This means that it may be inappropriate 
to use the estimated standard errors to obtain confidence intervals for the parameters. To address 
this issue, we re-fit the best-performing model using a Markov Chain Monte Carlo algorithm, as 


implemented in the MCMCglmm package in R ( [Hadfield, 2010] ). 

We examine the 95% posterior credible intervals (Cl) for each KC parameter (Figure [^. 
Recall that these were estimated with no restrictions on any coefficients. First, for 49 of the 
54 KC’s, the 5 coefficients for the effect of recent successes are significantly positive. The 
remaining 5 KCs have very wide Cl’s and are not significantly different than zero. These 5 KCs 
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Figure 2: AIC scores for S-only, R-only, PFA and AFM on Assistments data. Models are labeled 
with the decay parameter in parentheses. Smaller AIC is better. The best model in this set of 
models is R-only d = 0.6, marked with a red triangle. 
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Figure 3: AIC scores for R-AFM, R-only, PFA with equal S and F decays, and classic PFA on 
Assistments data. Models are labeled with the decay parameter in parentheses. Smaller AIC is 
better. The best model in this set of models is PFA d = 0.6, marked with a red triangle. 
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Figure 4: AIC scores with differential success and failure decays on Assistments data. For 
success, best-so-far decay is d = 0.6 for R in R-PFA, and also d = 0.6 for S in PFA. Models are 
labeled with the decay parameter in parentheses. Smaller AIC is better. The best model in this 
set of models is R-PFA with success decay 0.6 and failure decay 0.1, marked with a red triangle. 
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Figure 5: AIC scores for Assistments data. Models are labeled with the included predictors and 
the decay parameter in parentheses. Smaller AIC is better, the best model in this set of models is 
R(0.7), F(O.l), marked with a red triangle. This is the best-performing model overall. 
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Figure 6: Estimates for KC parameters in the best performing model: logit{pijt) = 0* + /3j + 
PjFijt + 6jRijt. The decay weight for F is 0.1, and for R is 0.7. For each KC, the dot indicates 
the posterior median for the parameter and the line indicates the 95% credible interval. KC’s are 
ordered by their /3 estimate with easier KCs at the top. 
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are among the easiest and hardest KCs, and were practiced by few students. In general, the more 
recent successes a student has had, the higher the probability of correctly responding to the next 
item, which corresponds to our intuitions about learning. 

We further examined the covariance among the KC parameters. The 95% CIs are r{(3, p) = 
(—0.62, —0.36), r(p, 5) = (0.32, 0.60), and r(/3, 5) = (—0.49, —0.10). Notably, there is a sig¬ 
nificant negative correlation between KC easiness and the effect of recent failures; for relatively 
more difficult KCs, the effect of recent failure on predicting a correct response is positive, while 
for relatively easier KCs, the effect of recent failure is negative. With easy KCs, recent failure 
would predict subsequent failure for students who are not acquiring the KC, or who are engaged 
in non-productive behaviors, e.g., gaming the system. Interestingly, for difficult KCs, recent 
failure is positively associated with subsequent success. 

It has been previously documented in PFA and AFM that the slopes for the effect of the count 
of past failures Fij (and occasionally even for the count of past successes S'^) are often negative, 
e.g., ( |Kaser et ah, 2014| ). Such negative slopes signal an area of concern (with the performance 
model itself or with the KC decomposition), because more practice, successful or unsuccessful, 
should increase the probability of a correct response. The R-PFA result that recent success is 
predictive of future success counters the negative-slope phenomenon. 


What is the source of R-PFA’S advantage over PFA? Although it is inappropri¬ 
ate to examine errors in prediction on a held-out or cross-validation set given the sparsity in our 
dataset, even comparing predictive accuracy on the training set is very illuminating. We present 
the difference in the predictions in figure The two rows of the figure correspond to actu¬ 
ally incorrect (top) and actually correct (bottom) outcomes. Each row is divided into 4 facets 
according to the value of the R predictor: 

• /( in [0, 0.3] indicates that the student has produced either 1 or fewer right answers in the 
last 4 attempts, or is at the very beginning of practice. 

• /? in (0.3, 0.5] indicates 2 correct answers in the last 3-4 attempts. 

• Rin (0.5, 0.7] implies that the most recent 2 answers were correct. 

• i? in (0.7,1] means that at least the last 3 answers were correct. 

The X and Y axes indicate the predictions from the PFA and R-PFA, respectively. This 
{x, y) position has a different meaning for the actually correct and actually incorrect outcomes. 
For example, the top-right quadrant for the actually incorrect outcomes indicates false positive 
values, due to both PFA and R-PFA wrongly predicting that the student will respond correctly. 
The top-right quadrant for the actually correct outcomes indicates indicates true positive values, 
due to both PFA and R-PFA accurately predicting that the student will respond correctly. 

There are notable difference between the models in two cases, roughly corresponding to 
very early practice on a skill and to relatively late practice. First, when the true outcome is 
an incorrect response and the student has had few recent successes, R-PFA is much better at 
predicting these incorrect outcomes than PFA (top row, i? in [0,0.3], TN Win). This is most 
often the case when the attempt is after the second (note the bubble color); both PFA and R- 
PFA often wrongly predict a correct response for this R value when the attempt is the first or 
the second (top row, i? in [0, 0.3], FP). This improvement in predicting when a student will fail 
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to answer correctly is an important contribution of R-PFA for intelligent tutors and adaptive 
systems. 

Second, when the student has had successes on the most recent items, R-PFA is more likely 
to predict a correct outcome than PFA. This is true both when the true outcome is correct, 
and when it is not, i.e., when the incorrect outcome is likely a slip. Ultimately, the number 
of false positive losses for R-PFA (top row, R in (0.5, 0.7] or (0.7,1.0]) is much lower than 
the number of true positive wins (bottom row, same R). To an intelligent tutor, accurately 
predicting slips is arguably unimportant. An intelligent tutor using R-PFA rather than PFA 
would be more aggressive and more accurate at predicting student mastery of a skill, allowing 
students to graduate from practicing a skill more quickly than PFA. 

When the student has had 2 correct answers in the last 3-4 attempts {R in (0.3, 0.5]), it is 
hard to know whether to expect a correct or an incorrect outcome. In the aggregate, PFA and 
R-PFA perform comparably in this case. 

5. Simulation Study 

The purpose of the simulation study was to compare R-PFA against other performance models 
without the limitations of real datasets. Two characteristics of the Assistments dataset examined 
above complicate a thorough model comparison. First, the data sparsity in the Assistments 
data (section precludes the use of cross-validation for model ranking. Nonetheless, cross- 
validation is a popular tool model comparison because holding out data during the model training 
can help prevent overfitting. Thus, we compare R-PFA to the other models both according to 
cross-validation and according to AIC. By using simulated data, we demonstrate that AIC is not 
only an appropriate model selection measure, but that it is better than cross-validation with the 
oft-used 0-1 loss function, and it can accommodate sparse data. 

Second, the stopping criterion used in the Skill Builder feature of Assistments leads to data 
missing non-randomly. Once Assistments determines that a student has mastered a skill, there 
are no further practice opportunities for the student on this skill. In fact, we expect that no 
mastery criterion is perfect, and even “mastered’ students may have future incorrect practice. 
Thus, we would like to use data with evidence of post-mastery performance to train and evaluate 
our models. We demonstrate that decay weight d in the 0.6-0.8 range is optimal even when there 
is no stopping rule that affects data generation. 

5.1. Methods 

We first simulated data from the Bayesian Knowledge Tracing (BKT) model and two adaptations 
of BKT. We then compared the fit of seven logistic test models on each simulated data set. 
Classic BKT describes ‘ideal’ student behavior, which may not capture all student behavior. 
Our two adaptations of the BKT model address this by incorporating more realistic student 
behavior. Thus, we can compare the logistic models in the presence of less than ideal student 
responses. 

In one adaptation of the classic 2-state BKT model, we posit a 3-state BKT model. In classic 
BKT, there are two states: a learned state where the student has a high probability of correctly 
responding to a question, and an unlearned state where the student has a low probability of cor¬ 
rectly responding. In the 3-state BKT model, the states are unlearned, practicing, and fluent. In 
the unlearned state, students have a very low probability of correctly responding to a question. 


19 


In the fluent state, students have a very high probability of eorreetly responding. In the prac¬ 
ticing state, students are learning the KC, but their understanding is not complete, so they have 
only moderate probabilities of a correct response. When generating data, this specification will 
produce more interwoven sequences of O’s and I’s than the 2-state BKT model. For example, we 
will see more patterns like Xij = (0,0,1,0,1,0,1,1,0,1,1), rather than primarily patterns of 
the form Xij = (0, 0, 0, 0,1,1,1,1). In this way, 3-state BKT incorporates realistic “struggling” 
students’ behavior. This 3-state model only serves a generative purpose; data generated from 
the 3-state model could also be fit by the 2-state model. 

In the (BKT+FS) adaptation to the BKT model, we vary the behavior of different simulated 
students. We include a small proportion of students who occasionally engage in unproductive 
learning behavior that produces long strings of incorrect responses. In real datasets, such data 
may be produced by various causal mechanisms, e.g., by lacking mastery of a prerequisite KC, 
by abusing hints ( |Aleven and Koedinger, 2000] ), or by gaming the system ( [Baker et ah, 2004] ). 
Specifically, a fraction of students may be likely to generate long strings of incorrect responses; 
the probability of being such a student is 0.08. These students then generate long strings of 
incorrect responses with a probability that varies by KC. When a student is engaged in this 
behavior, the probability of a correct response is 0.02; when a student is not engaged in this 
behavior, data is generated from the usual two-state BKT model. We refer to this model as BKT 
with failure sequences (BKT+FS). 

The data size in each simulation is near the size of the Assistments dataset (section j^, with 
50 KCs and 3500 students. Each student practiced a random number of KCs, generated by 
a Poisson distribution with a mean of 5. The number of opportunities for a student to practice 
each KC also varied randomly, generated by a Poisson distribution with a mean of 8. This means 
the number of opportunities for practice is statistically independent of the KC. This eliminates 
the uneven sparsity observed in the real data set. Uniform sparsity makes it possible to use 
cross-validation over students, and to compare cross-validation to AIC. 

For each of the three variations of BKT (classic and two adaptations), we ran 100 simula¬ 
tions. For each of the 300 simulated data sets, we fit 7 models: AFM, PFA with no decay, and 
R-PFA with 5 different values of the decay weight for the weighted proportion of successes : 
d = 0.2,0.4,0.6,0.8,1.0. For the count of failures, we fixed the decay weight d = 0.1, since the 
smallest decay parameter for failures was always optimal for the Assistments data. Complete 
details for each of the BKT variations are provided in appendix]^ and code is posted online]^ 

To allow model evaluation with cross-validation, we modified the 7 models in two ways from 
the ones that we fit on Assistments data: we omitted the student effect, and we used fixed rather 
than random effects for the KC parameters. First, omitting student effects allows us to make 
predictions for new students. It also implicitly assumes that any new student for whom we will 
be generating predictions is of average ability. Second, the uneven sparsity in the Assistments 
data necessitated random effects for KCs and students, but when there is sufficient data at each 
level, the estimates for random effects from the R function glmer and fixed effects from the R 
function glm, used here, will be effectively the same. 


Measures of Model Fit On each simulated data set, we compared the 7 models using 3 
different measures of model fit: AIC, 5-fold cross-validation (CV) with 0-1 loss (equation [^, 
and 5-fold CV with Li prediction error loss (PE loss, equation]^. We omitted the comparison 

"https://sites.google.com/site/aprilgalyardt/research 
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to 0-1 loss with a single test set (i.e., 1-fold CV), beeause that measure is noisy and unreliable 
due to high varianee of the estimate ([James et ah, 2013[). 


5.2. Results AND Discussion 

Model rankings for the simulations by eaeh of the goodness-of-fit measures are displayed in 
figures [^1^ and[T^ 

Cross-validation with 0-1 Loss has High Variance when two models have a very 
similar fit to the data, measures of model ranking might reasonably diverge in whieh one they 
rank as slightly better. For example, in the two-state BTK model in figure AIC ranks R- 
PFA((i = 0.6) as the best model in about 80% of the simulations, and ranks R-PFA((i = 0.8) 
in seeond plaee. In eontrast, CV-PE ranks R-PFA((i=0.6) as the best model only about 40% of 
the time, and puts R-PFA(d=0.8) in first plaee in 60% of the simulations. This is the ki nd of 
behavior we expeet when two models are similar. Both model measures elearly agree that these 
are the best two models eompared to the rest of the available models. 

By eontrast, in about 40% of the simulations, CV 0-1 ranks PFA as the best model, and 
in another 40% of the simulations it ranks PFA in 6th plaee. This implies that either the PFA 
regression has highly unstable performanee, or that eross-validation with 0-1 loss is an unreliable 
metrie. However, CV with PE loss and AIC are very reliable; they always rank AEM in 7th plaee, 
PEA in 6th plaee, and R-PEA r(0.2) f(0.1) in 5th plaee. 

Moreover, CV with 0-1 loss also fails to reliably rank R-PEA models with the varying deeay 
parameters, while CV with PE loss and AIC do rank them reliably. R-PEA with <7 = 0.6 and d = 
0.8 are the highest-ranked (and both very elose to the best-performing model on the Assistments 
data, whieh had d = 0.7). R-PEA with bandwidths of 0.4 and 1.0 are not as good. These all 
outrank R-PEA with d = 0.2, PEA and AEM in 100% of the simulations. 

The three measures of model fit do not have equal diseriminating power. CV with PE loss 
and AIC rank the models reliablys, and they also largely agree with eaeh other. Cross validation 
with 0-1 loss is an unreliable measure of model fit. 

Model Rankings in all 3 simulation eonditions (elassie BKT, 3-state BKT, and BKT-i-ES), 
in 100% of the simulations, the R-PEA models had higher predietive aeeuraey (judged by AIC 
and eross-validation with PE loss) than PEA or AEM. Eor all 3 eonditions, the best model was 
R-PEA with a deeay parameter of 0.6 or 0.8. 

In the two-state BKT simulation, AIC and CV-PE rank the models in the same order. CV 
0-1 produees ambiguous and unreliable model rankings. R-PEA with deeay parameters of 0.6 
and 0.8 are the best. R-PEA with bandwidths of 0.4 and 1.0 are not as good. The bandwidth of 
<7=0.2 is ranked as the worst R-PEA model in 100% of the simulations, with PEA in 6th plaee 
100% of the time, and AEM in last plaee 100% of the time. 

In the three-state BKT simulation, onee again, CV with 0-1 loss produees ambiguous and 
unreliable model rankings. Aeeording to AIC and CV-PE, R-PEA(<7=0.6) is ranked best 100% 
of the time, with <7 = 0.4 most often in seeond plaee and <7 = 0.8 most often in third plaee. PEA 
and AEM are again in 6th and 7th plaee respeetively. 

Einally, in the BKT-i-ES simulation, CV with 0-1 loss agrees with the other measures. R-PEA 
with <7 = 0.6 or <7 = 0.8 are ranked as the best models, while PEA is ranked in 6th plaee in 100% 
of the simulations by all three measures. The presenee of long strings of ineorreet answers. 
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proportion 



AFM PFA, R-PFA, R-PFA, R-PFA, R-PFA, R-PFA, 

S(1), F(1) R(0.2), F(0.1) R(0.4), F(0.1) R(0.6), F(0.1) R(0.8), F(0.1) R(1.0), F(0.1) 

Model 


Measure 


I AIC 

CV, 0-1 loss 
CV, PE loss 


Figure 8: Model rankings in the two-state BKT simulation. The rows of graphs indicate the 
model rankings, e.g., “Rank 3” indicates that the model was ranked as the third best model in 
a particular simulation. The height of each bar shows the proportion of simulations where the 
measure ranked each model. Taller bars at the top of the graph indicate that a model was ranked 
as a better model. 
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proportion 



AFM PFA, R-PFA, R-PFA, R-PFA, R-PFA, R-PFA, 

S(1), F(1) R(0.2), F(0.1) R(0.4), F(0.1) R(0.6), F(0.1) R(0.8), F(0.1) R(1.0), F(0.1) 

Model 


Figure 9: Model rankings for each measure in the three-state BKT simulation. The rows of graphs 
indicate the model rankings, e.g., “Rank 3” indicates that the model was ranked as the third best 
model in a particular simulation. The height of each bar shows the proportion of simulations 
where the measure ranked each model. Taller bars at the top of the graph indicate that a model 
was ranked as a better model. 
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AFM PFA, R-PFA, R-PFA, R-PFA, R-PFA, R-PFA, 

S(1), F(1) R(0.2), F(0.1) R(0.4), F(0.1) R(0.6), F(0.1) R(0.8), F(0.1) R(1.0), F(0.1) 

Model 


Measure 


I AIC 

CV, 0-1 loss 
CV, PE loss 


Figure 10: Model rankings for each measure in the BKT+FS simulation. The rows of graphs 
indicate the model rankings, e.g., “Rank 3” indicates that the model was ranked as the third best 
model in a particular simulation. The height of each bar shows the proportion of simulations 
where the measure ranked each model. Taller bars at the top of the graph indicate that a model 
was ranked as a better model. 
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which are produced occasionally by 8% of the students, makes total number of practice attempts 
and total number of failures very poor predictors of future success. But because R-PFA only 
considers recent history, it is not affected by these patterns. A student who was incorrect on the 
last couple of opportunities for any reason is estimated to have a low probability of responding 
correctly on the next opportunity. 

The consistent model rankings across all 3 simulation conditions indicates that recent history 
is a better predictor of learning than a student’s full history. The optimal decay parameter range 
is consistently 0.6-0.8. Thus, the last 3-5 practice opportunities contain sufficient information to 
judge whether or not a student has learned the KC. 

Simulating from the two- or three-state BKT model offers a best-case scenario for the AFM 
and PFA models. In a BKT model with no forgetting, the more opportunities that a student has 
to practice, the more likely it is that a student will transition from the unlearned state to the 
learned state. Therefore, on average, the total number of opportunities to practice should be 
proportional to the probability of a correct response. Yet even in this case, R-PFA makes better 
predictions than PFA. 

When realistic student behavior is added in the BKT-i-FS simulation, the advantage of R- 
PFA over PFA becomes even more distinct. The analysis on Assistments data reveals two ways 
in which the predictions of R-PFA differ from the predictions of PFA: First, R-PFA is better at 
predicting incorrect answers. Second, the difference between 0-1 loss and prediction error loss 
indicates that R-PFA has higher confidence in its accurate predictions. 

6. Conclusions 

The primary contributions of this work are: 

• the R-PFA model itself, and its publicly available implementation 

• the comparison of R-PFA to published models and novel baselines on real-world and 
simulated data, which demonstrates how a student’s recent performance history evidences 
whether or not they have acquired a particular knowledge component 

• the novel visualizations comparing PFA and R-PFA performance, which facilitate logis¬ 
tic regression diagnostics and reveal the source of R-PFA’s improvement over alternative 
models 

• the comparison of measures of model performance, which demonstrates how cross-validation 
with 0-1 loss is inferior to AIC and to cross-validation with Li prediction error loss 

On R-PFA R-PFA leverages prior work, including the separation of student and item charac¬ 
teristics (IRT), the grouping of items by skill and the significance of past performance (AFM), 
the separation of prior successful and unsuccessful practice (PFA), and discounting of older 
evidence by Gong et al. 

R-PFA adds several novel insights to this model evolution. First, a decay-weighted propor¬ 
tion of successes is a better predictor than a decay-weighted count of successes. Second, decay 
weights should be tuned, rather than determined heuristically (as in the work by Gong et ah). 
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Third, decay weights for successes and failures should be tuned separately. Fourth, it is rea¬ 
sonable and effective to inform the model with a prior ’’belief’ that students who have never 
attempted the skill will likely fail to answer correctly, e.g., using ghost attempts. 

In aggregate, these insights lead to improvements in predictive accuracy in the true negative 
rate when recent history contains few correct attempts, and in the true positive rate when recent 
history mostly consists of correct attempts. 

The optimal amount of recent history for modeling is consistent across all of the simula¬ 
tions, and the Assistments data; the best decay parameter for recent successes is consistently 
d = {0.6, 0.7, 0.8}. With these decay rates 75-93% of the weight is on the last 5 attempts, 
and 55-78% of the weight is on the last 3 attempts. Thus, empirically, these last 3-5 attempts 
contain sufficient information about the student’s knowledge state to make accurate predictions. 


Interestingly, this d supports the heuristic, implemented in some adaptive learning systems (Hef- 


fernan and Heffernan, 2014), that a student has mastered a skill if a student responds correctly 
to 3-5 questions in a row on the skill. However, the R predictor is a kind of average that does 
not require an unbroken streak of correct attempts. 

R-PFA is relatively simple mathematically, adding only two tuning weights beyond the pa¬ 
rameter structure of PFA. The stability of the decay weights identified in this work implies these 
weights might be reasonably treated as fixed in new uses of R-PFA, further reducing complex¬ 
ity of R-PFA. R-PFA is more realistic than PFA, because it distinguishes the predictive value 
of recent performance and older performance. Its predictor and parameter values are easily 
interpreted, even in the presence of student behaviors that are undesirable or unproductive for 
learning. Finally, its predictive accuracy improves on PFA. Thus, our ultimate assessment of 
R-PFA is that it is preferable to PFA and other logistic models in many circumstances where 
such a model might be used. 

The findings here cast doubt on the validity of the AFM model, because a non-decayed count 
of successes only, i.e., S-only with <7=1, outperforms AFM. At present, AFM has uses aside 


from prediction, including in skill model selection in Learning Factors Analysis (Cen et ah. 


2006b I, which may need to adopt different models. 


Although we did not compare the predictive accuracy of R-PFA to BKT (although BKT and 
PFA often have similar accuracy), R-PFA has a strength here as well. Knowledge Tracing is 
a rather complex Hidden Markov Model. It offers a plausible generative structure for student 
performance, but notoriously can fail to return interpretable or accurate estimates of parameters 
(e.g.. Beck and Chang 2007| ). The allure of BKT is the posterior updating of the probability that 
a student knows the KC after each practice opportunity. R-PFA accomplishes the same thing in 
simpler way. 

Nonetheless, the R-PFA project is by no means complete. In the future, we will consider 
the relationship of R-PFA to Bayesian Knowledge Tracing, because preliminary work suggests 
that the two models reveal interesting aspects in each other. We will consider how R-PFA may 
incorporate richer Q-matrices (multiple skills per item), and how R-PFA may be used to improve 
cognitive models. We will also extend the R-PFA model with additional predictors, as informed 
by the comparison to PFA reported above. 


On Methods This work brings to bear several methodological strengths in terms of model 
comparison, model fitting, and model analysis. 

We evaluated models on both real-world and simulated data. Although simulated data eval¬ 
uations are rare in the educational data mining literature, they are very popular in statistics. In 
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fact, we argue that real-world datasets have sparse data properties that neeessitate both kinds of 
eomparisons. 

The model fitting used random effeets for all model parameters for students and knowledge 
eomponents, for both intereepts and slopes. This was neeessitated by the sparsity in the Assist- 
ments dataset. The random effeets were used in both R-PFA and alternative models for a fair 
eomparison. 

Our model analysis provides evidenee that eross-validation with 0-1 loss, an immensely 
popular metrie in edueational data mining, is a poor ehoiee for model eomparison. Instead, we 
argue for the use of AIC as measure of model fit. AIC is seen to be equivalent to eross validation 


with an Li predietion error loss. This equivaleney is a known result (e.g., (Wasserman, 20041), 
but we demonstrate that this holds even when eross validation is making predietions for new 
students. Any divergenee in model ehoiee between AIC and eross validation is due to normal 
sampling variation, and usually indieates that the fit of the models is similar. A severe divergenee 
in agreement (whieh we did not see in our simulation) may indieate that the sample size is 
too small for the model eomplexity. We note that these eonelusions about AIC extend to the 
Bayesian information eriterion (BIC), save that BIC has a higher penalty for model eomplexity. 
Cross-validation ean be a eomputationally and time intensive proeess. AIC and BIC offer a 
faster and simpler equivalent alternative. 


A Simulation Details 
A1. Two-state BKT SIMULATION 

This is the usual BKT model, it is a hidden Markov model with an unlearned and a learned state. 

• Knowledge eomponents are indexed j = 1,..., iF 

• Students indexed i = 1,..., 

• Student Ts response on the opportunity to praetiee KC j: 

0 if ineorreet 
1 if eorreet 

• Denote student Ts unobserved knowledge of KC j on the opportunity as 

1 if unlearned state 

2 if learned state 

• Probability of initially knowing KC j: Pr{Ziji =2) = wj ~ Beta{l, 2). 

This distribution has positive probability for all values on the interval [0,1], but is eentered 
at a mean of IE[7rj] = |. This eneapsulates our expeetation that in a well-targeted eduea¬ 
tional intervention, most of the students would not already know the majority of topies 
whieh will be taught. The density is shown in figure [TT] 

• Transition matriees for the Markov proeess are 

1 — Lj Lj \ 

0 1 J 
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Density of Beta(1,2) 



Figure 11: Density of Beta{l, 2) distribution. 


Lj is the probability of learning KC k following a praetiee attempt, generated aeeording 
to Lj ~ Beta{2, 2). 

This distribution positive probability for all values on the interval [0,1], but is eentered 
at E[Lfc] = If Lfc is near 1, then a student has a high probability of learning the skill 
after a single praetiee attempt. In the same way if Lk is near 0, then a student has a 
low probability of learning the skill, regardless of how much they practice. This Beta 
distribution places more probability near 0.5, and lower probability near 0 or 1, reflecting 
the idea that most students need to practice KCs a couple of times before they learn them. 
The density is shown in figure [T^ 


Density of Beta(2,2) 



Figure 12: Density of Beta{2, 2) distribution. 


• Probability of a correct answer in the unlearned state (guessing): Cuj ~ Uni f {0.02, 0.3) 

= ^\Zijt = 1 ) = Cuj 

• Probability of a correct answer in the learned state (1-slip): Cij ~ Uni f {0.7, 0.98) 

P'r'{Xijt = 7\Zijt = 2 ) = Cij 


28 




• Average number of KC’s seen by each student is fixed at K. n = 5 

• Number of KC’s seen by student i is generated Ji ~ min{iC, Poisson{K.n)}. 

• The KC’s that student i answers are drawn without replacement from {1,..., A'}. 

• T . avg = 8 is the average number of practice opportunities for any student on any KC. 

• The number of practice opportunities for student i on KC k is Oij ~ m.ayi{Poisson{T .avg) , 2}. 
So that if a student practiced a KC, they practiced it at least twice. 

A1.1. Three-State BKT simulation 

The 3-state BKT model uses the states unlearned, practicing, and fluent. Students in the un¬ 
learned state have a low probability of answering correctly. Students in the practicing state have 
moderate probabilities of answering correctly. We may think of students in this state as largely 
understanding the ideas and knowing what to do, but slipping frequently perhaps due to high 
working memory loads or other causes. Students in the fluent state have a very high probability 
of answering correctly. 


• Student As response on the opportunity to practice KC j: 


J 0 if incorrect 
^ 1 if correct 


• Denote student As unobserved knowledge of KC j on the opportunity as 


Zijt 


1 if unlearned state 

2 if practicing state 

3 if fluent state 


• Probability for initial states: vr^ = 7rj2, vr^a). 


P{Zijo 

P{Zijo 

P{ZijQ 


1) = ~ Beta{2, 2) 

2) = 7Tj2 = 1 — TTji 

3) = TTj3 = 0 


This distribution for ttj assumes that no student begins practice in the fluent state, so that 
practice will benefit all students. The Beta{2,2) distribution is shown in figure [T^ vr^i 
can take any value between 0 an 1, but it is more likely to take values nearer to 0.5. This 
simulates the idea that for an average KC approximately half the students will start out not 
knowing the KC at all, and the other half of the students need more practice. 

• Transition matrices for the Markov process are 


Ljii 

0 

0 


1 ~ Ljii 0 

Lj22 1 — Lj22 

0 1 
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where 


Ljn,Lj 22 ~ Beta{2,2). 

With these transition matrices, a student may not transition directly from the unlearned 
to the fluent state over a single opportunity. However, since this is a 1st order Markov 
process, it is possible and fairly likely that some students will transition from unlearned to 
fluent within 2 practice opportunities. 

• Probability of a correct answer in the unlearned state (guessing): Cuj ~ (7m/(0.02, 0.2) 

Pr{Xijt = i\Zijt = 1 ) = Cuj 

• Probability of a correct answer in the practicing state: Cpj ~ Unif{0A, 0.7) 

P'f'i.Xijt = = 2 ) = Cpj 

• Probability of a correct answer in the fluent state (1-slip): Cfj^^U m/(0.85,1) 

Pr{Xijt = ^\Zijt = 3 ) = Cfj 

• Average number of KC’s seen by each student is fixed at K. n = 5 

• Number of KC’s seen by student i is generated Ji ~ min(7f, Poisson{K.n)). 

• The KC’s that student i answers are drawn without replacement from {1,..., K}. 

• T . avg = 8 is the average number of practice opportunities for any student on any KC. 

• The number of practice opportunities for student i on KC k is Oij ~ max{Poisson{T.avg ), 2). 
So that if a student practiced a KC, they practiced it at least twice. 

A1.2. BKT-i-FS simulation 

The second adaptation to the familiar 2-state BKT model includes a small proportion of students 
who occasionally engage in unproductive learning behavior, which produces long strings of 
incorrect responses, or failure sequences. This behavior might appear for many different reasons, 
such as the student engaging in hint-abuse or other gaming behaviors, or the student may simply 
lack a key prerequisite KC. As a shorthand, we shall refer to students who engage in this behavior 
as FS-students. 

On each KC, the FS-students will have a probability of engaging in the FS behavior for that 
KC. The probability that these students will engage in the behaviors depends on the KC, not the 
student. Whether a student ever engages in the FS-behavior depends on the student. When a 
student does so depends on the KC. 

When a student does engage in FS-behavior, their responses will be a string of primarily 
incorrect responses with high probability. When a student is not engaging in FS-behavior, data 
is generated according to an unmodified two-state BKT model. 

• Student Fs response on the opportunity to practice KC j: 

Y _ / 0 incorrect 
( 1 if correct 
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• Denote student i’s unobserved knowledge of KC j on the opportunity as 

1 if unlearned state 

2 if learned state 

• Probability for initial states: Tij = {nji, T^j 2 )- Note that IE[7rji] = f, and the distribution is 
shown in figure [TT]). 

P{Zijo = 1) = TTji ~ Beta{l, 2) 

Pi^ZijQ 2) 7rj2 1 tTji 

• Transition matriees for the Markov proeess are 

1 — Lj Lj \ 

0 1 J 

Lj is the probability of learning KC j following a praetiee attempt, generated aeeording 
to Lj ~ Beta{2, 2). (figure [T^. 

• Probability of a eorreet answer in the unlearned state (guessing): Cuj ~ Uni f {0.02, 0.3) 

P'f'{^ijt = M^ijt = 1 ) = Cuj 

• Probability of a eorreet answer in the learned state (1-slip): Cij ~ Uni f {0.7, 0.98) 

P'f'{^ijt = M^ijt = 2 ) = Cij 




• To simulate the FS-behavior: 

- For eaeh student draw the indieator Gi for whether student i engages in the FS- 
behavior, Gi ~ Bernoulli{0.08). 

- For eaeh KC j, draw a probability that one of the FS-students will engage in this 
behavior on this KC. Bj ~ Uniform{0, 1). 

- Draw an indieator for whether student i will engage in this behavior on KC j 

Wij\Gi = 1 ~ Bernoulli{Bj) 

^ij\Gi = 0 = 0 

- If Wij = 0, then generate Xij from the 2-state BKT model. 

- If Wij = 1, then for f = 1,..., T^, Xijt\Wij = 1 ~ Bernoulli{0.2). 

• Average number of KC’s seen by eaeh student is fixed at K. n = 5 

• Number of KC’s seen by student i is generated Ji ~ min(iC, Poisson{K.n)). 

• The KC’s that student i answers are drawn without replaeement from {1,..., Ff}. 

• T . avg = 8 is the average number of praetiee opportunities for any student on any KC. 

• The number of praetiee opportunities for student i on KC k is Oij ~ max{Poisson{T.avg), 2). 
So that if a student praetieed a KC, they praetieed it at least twiee. 
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